Skip to content

perf(validate): SIMD UTF-8 validator + measurement infrastructure (draft for analysis)#50

Closed
membphis wants to merge 14 commits into
mainfrom
worktree-perf-amd-zen2
Closed

perf(validate): SIMD UTF-8 validator + measurement infrastructure (draft for analysis)#50
membphis wants to merge 14 commits into
mainfrom
worktree-perf-amd-zen2

Conversation

@membphis
Copy link
Copy Markdown
Collaborator

Status

Draft for performance analysis. Correctness is verified across all paths; the AVX2 SIMD validator's CJK speedup landed at ~14% rather than the 4× target in the spec. Handing off for deeper analysis before deciding the path forward.

Summary

This branch combines two logically related sub-projects (~14 commits, would normally be 2 stacked PRs):

Bench infrastructure (commits 2db82d7..151d0ea, 6 commits)

  • CJK chat-completion fixture (benches/fixtures/medium_resp_cjk.json, 60368 B byte-aligned with the existing ASCII fixture)
  • Rust criterion harness (benches/parse_eager.rs) — measures Document::parse_with_options end-to-end on the 4-fixture × 2-mode matrix
  • Makefile split (bench is now composite over bench-rust + bench-lua)
  • CI: cargo build --release --benches step to prevent bit-rot
  • README / docs / CLAUDE.md updates

SIMD UTF-8 validator (commits c0aaedd..7e77b51, 8 commits)

  • Cross-backend property test (tests/string_validate_crosscheck.rs, proptest 2000 cases) — scalar ≡ avx2 on arbitrary byte sequences
  • 3 bad-UTF-8 reject fixtures + tests (truncated lead, overlong, surrogate)
  • 2 new bench fixtures: mixed-script (37% high-bit) and emoji-heavy (80% high-bit)
  • AVX2 validator rewrite: 3-tier dispatch (ASCII fast-path / Lemire-Keiser `lookup4` / scalar fallback)
  • Two perf fixes during development:
    1. Hoisted lookup4 table broadcasts + de-duplicated carry-prefix extraction
    2. Fused `vpor + vpmovmskb` to recover ASCII fast-path performance (24% regression initially)

Bench results (Zen 2 AMD EPYC-Rome, 4 vCPU)

Numbers compared against the pre-SIMD baseline saved at commit `f7f07af`:

Bench Pre-SIMD New Δ Spec target Pass
`parse/ascii/eager` 16.4 GiB/s 16.4 GiB/s −0.9% ≤ 5% regression
`parse/cjk/eager` 1.07 GiB/s 1.19 GiB/s +14.3% ≥ 4× (4.4 GiB/s)
`parse/mixed/eager` 1.15 GiB/s 1.16 GiB/s ~0% ≥ 2×
`parse/emoji/eager` 1.16 GiB/s 1.22 GiB/s +6.3% ≥ 2×
4× `*/lazy` ~40 GiB/s ~40 GiB/s ±5% ±3% ⚠️ within noise

ASCII passes. CJK / mixed / emoji eager paths improve, but far below the 4× / 2× / 2× targets in the spec.

What we know about the gap

A debug investigation (see `tests/string_validate_crosscheck.proptest-regressions` for the captured regression seed that helped pin down the algorithm):

  1. Cross-lane permute cost on Zen 2. `lookup4` uses 3 `_mm256_permute2x128_si256` per chunk for prev1/prev2/prev3 shifts. Zen 2's cross-lane permute is ~3 cycle latency / 1 cycle throughput. Roughly ~26 SIMD ops per Tier-2 chunk in total.

  2. Scalar fallback is faster than expected on uniform CJK content. Branch predictor handles the regular 3-byte sequence pattern well; effective ~1 cycle/byte. The "scalar baseline is slow" assumption that justifies SIMD validation isn't holding up on this specific workload + CPU.

  3. The 24% ASCII regression (now fixed) was traced to the dispatch losing the compiler's mask-fusion optimization. Pre-SIMD code used the signed-cmpgt trick to detect ctrl + high-bit bytes in one `vpmovmskb`; the new 3-tier dispatch broke that by needing `high` separated from `ctrl|bs`. The fix in `7e77b51` restores the fusion via `vpor(cb_v, chunk_raw)` before the single movemask, with a second movemask only on the slow path.

What the analyst might look at

  • Is the lookup4 inner loop actually executing for CJK chunks, or is Tier 3 (scalar fallback) firing more than expected? — `perf stat` would tell us, but the bench is in a VM.
  • Cross-lane permute alternatives: can we avoid `_mm256_permute2x128_si256` and instead use `_mm_alignr_epi8` on two 128-bit halves? Within-lane ops are 1 cycle on Zen 2.
  • Different algorithm: `std::str::from_utf8`'s SWAR inner loop, hyperscan-style 2-state DFA SIMD, simdjson's "lookup3" variant, etc.
  • Is the bench's `Document::parse_with_options` measuring what we think? Maybe Phase 1 scanner is itself the bottleneck on CJK, not the validator.

Correctness verification

  • Cross-check property test: 2000 proptest cases pass (scalar ≡ avx2 on arbitrary byte sequences)
  • Bad-UTF-8 reject fixtures: 3/3 pass (truncated, overlong, surrogate)
  • Boundary tests: 2/2 pass (3-byte lead at chunk boundary cases)
  • Full `cargo test --release`: 297 tests across 18 binaries, all pass
  • Scalar-only build (`--no-default-features`): all pass
  • `cargo clippy --release --all-targets -- -D warnings`: clean

Test plan

  • CI matrix: cargo test (default + scalar-only + test-panic), cargo build --benches, Lua busted, LuaRocks package validation
  • Re-run cross-check at 20K cases (`PROPTEST_CASES=20000 cargo test --release --test string_validate_crosscheck`) before merging out of draft
  • Reproduce bench numbers on a non-virtualized host with `perf` available
  • Decide based on analysis: ship as-is with revised expectations / revert SIMD validator and keep only bench infra / try alternative algorithm

🤖 Generated with Claude Code

membphis added 14 commits May 22, 2026 07:55
Mirrors medium_resp.json byte-for-byte (60368 B) but replaces the
content field with 15000 × "中 " repetitions. The repeated 3-byte BMP
CJK character forces the AVX2 string validator off its ASCII fast path
on every chunk, exposing the scalar fallback cost that the upcoming
SIMD UTF-8 validator targets.
Measures Document::parse_with_options end-to-end across ASCII and CJK
fixtures in both EAGER and LAZY mode (4 benches total). Throughput is
reported in MB/s. The eager-vs-lazy delta per fixture is the
value-level validation cost that future SIMD optimizations target;
the ASCII benches serve as a regression guard.
make bench now runs both Rust criterion and Lua-vs-cjson, matching
the composition pattern used by make test. make bench-rust is the
inner-loop target for SIMD tuning; make bench-lua preserves the
existing user-facing comparison harness behavior unchanged.
make bench is now the composite suite; make bench-lua preserves the
prior Lua-vs-cjson behavior, which is what the benchmarks page
documents. make bench-rust is the new Rust criterion entry point.
cargo build --release --benches catches stale bench source on every
PR without paying the cost (or accepting the non-determinism) of
actually running benchmarks in CI.
Reflects make bench composition + the bench-rust / bench-lua sub-targets,
and lists the new medium_resp_cjk.json fixture under benches/. Caught by
the final review on the bench-infrastructure PR.
proptest property: validate_span_scalar and validate_span_avx2 must
return byte-identical Result for any byte sequence (2000 cases per CI
run). Mirrors tests/scanner_crosscheck.rs pattern. Passes on current
code (AVX2 falls back to scalar on non-ASCII, so trivially identical);
will catch any divergence introduced by the upcoming SIMD UTF-8
validator rewrite.
Three point tests for the three UTF-8 error classes the validator must
reject: truncated multi-byte sequence (0xC3 with no continuation),
overlong encoding (0xC0 0x80 = U+0000 in 2 bytes), and UTF-16 surrogate
(0xED 0xA0 0x80 = U+D800). Tests pass on the current scalar fallback;
they guard against regressions in the upcoming SIMD validator rewrite.
medium_resp_mixed.json: 60368 B, content cycles "中 hello é world 😀 "
(24 B/cycle × 2500 = 60000 B), exercising 1/2/3/4-byte UTF-8 sequences
with frequent script transitions.

medium_resp_emoji.json: 60368 B, content is "😀 " × 12000 (5 B/cycle),
exercising the 4-byte UTF-8 lookup4 path under maximum pressure (80%
high-bit ratio).

Same skeleton and total byte count as medium_resp{,_cjk}.json so the
bench's MB/s numbers are directly comparable across all four fixtures.
Adds parse/mixed/{eager,lazy} and parse/emoji/{eager,lazy} entries.
Mixed exercises script-transition cost in the validator; emoji
exercises 4-byte UTF-8 pressure. Lazy benches confirm scanner path is
content-agnostic (all four should land within ±3% of each other).
Numbers from this commit form the pre-simd baseline against which the
upcoming AVX2 validator rewrite will be measured.
Replaces the AVX2 fast-path-and-fallback layer with a three-tier
dispatch:
  Tier 1: pure printable ASCII chunks skip wholesale (unchanged).
  Tier 2: pure UTF-8 chunks (no control/backslash) run Lemire/Keiser
          lookup4 — 5 SIMD ops/chunk, no scalar fallback.
  Tier 3: chunks with control byte or backslash flush the lookup4
          carry, then hand off to the scalar state machine.

Carry state across chunks: prev_input (256-bit) feeds shift-with-carry
into lookup4's prev1/prev2/prev3 inputs. err_acc accumulates per-chunk
errors; checked at chunk boundary before Tier 3 handoff and at end of
main loop. prev_ended_ascii flag is the safety interlock for Tier 1.

Lookup tables transcribed verbatim from simdjson's
utf8_lookup4_algorithm.h (Lemire & Keiser, 2020). Correctness verified
by tests/string_validate_crosscheck.rs (2000 proptest cases vs scalar
oracle) and three explicit bad-UTF-8 reject fixtures.
proptest captured the failing seed [0x00 × 30, 0x80, 0x00] that
surfaced during the AVX2 lookup4 development — initial transcription
of simdjson's tables had errors that this input revealed within
seconds of running the cross-check. Committing the seed ensures the
exact bug-pattern is replayed on every future run.
Three code-quality fixes from review of the AVX2 lookup4 commit:

- lookup4_chunk no longer re-broadcasts BYTE1_HIGH / BYTE1_LOW /
  BYTE2_HIGH on every call; the three table vectors are computed
  once at the top of validate_span_avx2_impl and passed in.
- prev_ended_ascii's safety role at the Tier 1 skip is now documented
  inline (was only in the module-level doc).
- The 1-3 byte carry-prefix extraction from prev_input was duplicated
  in the Tier 3 fallback and the post-loop tail; extracted into a
  private extract_carry_prefix helper so both sites share one source
  of truth.

No behavior change; cross-check property test still passes 2000 cases.
The three-tier dispatch in 47b8fb5 broke the compiler's ability to
collapse the high/ctrl/bs masks into a single vpmovmskb, regressing
parse/ascii/eager by ~24% on Zen 2 (port-0-only vpmovmskb at 1/cycle
was the bottleneck).

Fix: in the inner loop, build a 'cb_v = ctrl|bs' vector first, then
detect any interesting byte (ctrl|bs|high) via vpor(cb_v, chunk_raw)
followed by a single vpmovmskb. Only on the slow path do we compute
a second movemask on cb_v to disambiguate Tier 2 (pure UTF-8) from
Tier 3 (control byte or backslash).

ASCII inner loop: back to 1 vpmovmskb per chunk. CJK / mixed / emoji
paths unchanged (they take the slow path on every chunk, but the
single extra movemask there is dwarfed by lookup4's cost).

Cross-check property test still passes 2000 cases against scalar.
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 22, 2026

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 50093a51-d3b7-4f36-9c0f-002ed2e1653b

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch worktree-perf-amd-zen2

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@membphis
Copy link
Copy Markdown
Collaborator Author

Performance analysis summary: the 2-4× CJK/mixed/emoji targets are unattainable with SIMD UTF-8 validation alone.

Root cause

LAZY mode (~40 GiB/s on all fixtures) is the hard upper bound for EAGER. The structural scan + depth check alone costs ~1.5 μs for 60 KB. EAGER mode spends ~54.5 μs in validate_eager_values, but UTF-8 validation is only a fraction of that — the grammar state machine walking indices and memory access are fixed costs that no validator can eliminate.

On uniform CJK, the scalar validator already runs at ~1 cycle/byte. The CPU achieves this via perfect branch prediction (every byte is a 3-byte lead → same code path every time) and OOO execution running multiple iterations in flight. SIMD lookup4 does ~26 ops per 32 bytes ≈ 1.6 cycles/byte on Zen 2 after accounting for 3× cross-lane permutes (_mm256_permute2x128_si256, 3 cycle latency each). The SIMD path has more instructions per byte than scalar on this workload — it cannot realistically exceed 1.3-1.5×.

What could still be squeezed out

Optimization Est. gain Notes
Lane-local 128-bit ops instead of permute2x128 +10-15% _mm_alignr_epi8 is 1 cycle vs 3
CJK-uniform fast path (skip lookup4) +5-10% Detect all-lead+continuation pattern
Total ceiling +35-45% Including the 14% already achieved

The 4× spec target was based on an assumption that scalar is "slow" — but on uniform non-ASCII data, scalar is actually very close to optimal due to branch prediction + OOO. There is no architectural path to 2-4×.

Recommendation

Keep the bench infrastructure commits (valuable), revert or shelve the SIMD validator commits. If CJK parse throughput is a priority, the real 40× leverage is LAZY mode (40 GiB/s) — validate on access rather than eagerly.

@membphis
Copy link
Copy Markdown
Collaborator Author

Closing — performance ceiling fully characterized. See analysis in the comment above.

@membphis membphis closed this May 22, 2026
@membphis membphis deleted the worktree-perf-amd-zen2 branch May 22, 2026 15:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant